Distributional Analysis¶

Distributional analysis is a term I coined for a very simple yet powerful way of analyzing datasets. It means that you think of the dataset as a distribution within a large multidimensional space, which you then can examine through its marginal statistics in any two-dimensional subspace.

This is my favorite tool for exploratory data analysis—meaning you don't know what you're looking for just yet, you're simply getting to know the data. The best way to understand this is through examples. So let's turn our attention to one of my favorite datasets, the global surface drifter dataset.

Surface instruments record latitude and longitude—and therefore horizontal velocity—as well as, sometimes, temperature as they drift following the ocean currents. We're using this classic dataset:

Lumpkin, Rick; Centurioni, Luca (2019). Global Drifter Program quality-controlled 6-hour interpolated data from ocean surface drifting buoys. NOAA National Centers for Environmental Information. Dataset. https://doi.org/10.25921/7ntx-z961.

If you want to run this lab yourself, as opposed to simply viewing it, you'll need to grab the single NetCDF file version of this dataset. That file is currently called gdp_jul22_ragged_6h.nc and can be downloaded from here.

In [ ]:
#install needed packages using 
#conda install  -c conda-forge numpy matplotlib xarray dask netCDF4 bottleneck scipy pandas cartopy ipython ipympl nodejs cmasher jupyterlab git pip seaborn
#or its equivalent 

import time
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
from scipy import stats
import cartopy.crs as ccrs
import cartopy.feature as cfeature

import warnings  #ignore warnings for readability
warnings.filterwarnings('ignore')
In [ ]:
#load in hourly drifter dataset

#timing this script 

tic = time.time()

datadir = '/Users/lilly/Desktop/Dropbox/NetCDf/'  #path to the directory where you've put floats.nc
drifters = xr.open_dataset(datadir + 'gdp_jul22_ragged_6h.nc')
drifters
Out[ ]:
<xarray.Dataset>
Dimensions:                (traj: 26843, obs: 44544647)
Coordinates:
    ids                    (obs) int64 ...
    time                   (obs) datetime64[ns] ...
    lon                    (obs) float32 ...
    lat                    (obs) float32 ...
Dimensions without coordinates: traj, obs
Data variables: (12/44)
    ID                     (traj) int64 ...
    rowsize                (traj) int32 ...
    WMO                    (traj) int32 ...
    expno                  (traj) int32 ...
    deploy_date            (traj) datetime64[ns] ...
    deploy_lat             (traj) float32 ...
    ...                     ...
    vn                     (obs) float32 ...
    temp                   (obs) float32 ...
    err_lat                (obs) float32 ...
    err_lon                (obs) float32 ...
    err_temp               (obs) float32 ...
    drogue_status          (obs) bool ...
Attributes: (12/16)
    title:             Global Drifter Program six-hourly drifting buoy collec...
    history:           Last update July 2022.  Metadata from dirall.dat and d...
    Conventions:       CF-1.6
    date_created:      2022-12-08T18:44:27.784441
    publisher_name:    GDP Drifter DAC
    publisher_email:   aoml.dftr@noaa.gov
    ...                ...
    contributor_name:  NOAA Global Drifter Program
    contributor_role:  Data Acquisition Center
    institution:       NOAA Atlantic Oceanographic and Meteorological Laboratory
    acknowledgement:   Lumpkin, Rick; Centurioni, Luca (2019). NOAA Global Dr...
    summary:           Global Drifter Program six-hourly data
    doi:               10.25921/7ntx-z961
xarray.Dataset
    • traj: 26843
    • obs: 44544647
    • ids
      (obs)
      int64
      ...
      long_name :
      Global Drifter Program Buoy ID repeated along observations
      units :
      -
      [44544647 values with dtype=int64]
    • time
      (obs)
      datetime64[ns]
      ...
      long_name :
      Time
      [44544647 values with dtype=datetime64[ns]]
    • lon
      (obs)
      float32
      ...
      long_name :
      Longitude
      units :
      degrees_east
      [44544647 values with dtype=float32]
    • lat
      (obs)
      float32
      ...
      long_name :
      Latitude
      units :
      degrees_north
      [44544647 values with dtype=float32]
    • ID
      (traj)
      int64
      ...
      long_name :
      Global Drifter Program Buoy ID
      units :
      -
      [26843 values with dtype=int64]
    • rowsize
      (traj)
      int32
      ...
      long_name :
      Number of observations per trajectory
      sample_dimension :
      obs
      units :
      -
      [26843 values with dtype=int32]
    • WMO
      (traj)
      int32
      ...
      long_name :
      World Meteorological Organization buoy identification number
      units :
      -
      [26843 values with dtype=int32]
    • expno
      (traj)
      int32
      ...
      long_name :
      Experiment number
      units :
      -
      [26843 values with dtype=int32]
    • deploy_date
      (traj)
      datetime64[ns]
      ...
      long_name :
      Deployment date and time
      [26843 values with dtype=datetime64[ns]]
    • deploy_lat
      (traj)
      float32
      ...
      long_name :
      Deployment latitude
      units :
      degrees_north
      [26843 values with dtype=float32]
    • deploy_lon
      (traj)
      float32
      ...
      long_name :
      Deployment longitude
      units :
      degrees_east
      [26843 values with dtype=float32]
    • end_date
      (traj)
      datetime64[ns]
      ...
      long_name :
      End date and time
      [26843 values with dtype=datetime64[ns]]
    • end_lat
      (traj)
      float32
      ...
      long_name :
      End longitude
      units :
      degrees_east
      [26843 values with dtype=float32]
    • end_lon
      (traj)
      float32
      ...
      long_name :
      End latitude
      units :
      degrees_north
      [26843 values with dtype=float32]
    • drogue_lost_date
      (traj)
      datetime64[ns]
      ...
      long_name :
      Date and time of drogue loss
      [26843 values with dtype=datetime64[ns]]
    • typedeath
      (traj)
      int8
      ...
      long_name :
      Type of death
      units :
      -
      comments :
      0 (buoy still alive), 1 (buoy ran aground), 2 (picked up by vessel), 3 (stop transmitting), 4 (sporadic transmissions), 5 (bad batteries), 6 (inactive status)
      [26843 values with dtype=int8]
    • typebuoy
      (traj)
      |S10
      ...
      long_name :
      Buoy type (see https://www.aoml.noaa.gov/phod/dac/dirall.html)
      units :
      -
      [26843 values with dtype=|S10]
    • DeployingShip
      (traj)
      |S20
      ...
      long_name :
      Name of deployment ship
      units :
      -
      [26843 values with dtype=|S20]
    • DeploymentStatus
      (traj)
      |S20
      ...
      long_name :
      Deployment status
      units :
      -
      [26843 values with dtype=|S20]
    • BuoyTypeManufacturer
      (traj)
      |S20
      ...
      long_name :
      Buoy type manufacturer
      units :
      -
      [26843 values with dtype=|S20]
    • BuoyTypeSensorArray
      (traj)
      |S20
      ...
      long_name :
      Buoy type sensor array
      units :
      -
      [26843 values with dtype=|S20]
    • CurrentProgram
      (traj)
      float64
      ...
      long_name :
      Current Program
      units :
      -
      [26843 values with dtype=float64]
    • PurchaserFunding
      (traj)
      |S20
      ...
      long_name :
      Purchaser funding
      units :
      -
      [26843 values with dtype=|S20]
    • SensorUpgrade
      (traj)
      |S20
      ...
      long_name :
      Sensor upgrade
      units :
      -
      [26843 values with dtype=|S20]
    • Transmissions
      (traj)
      |S20
      ...
      long_name :
      Transmissions
      units :
      -
      [26843 values with dtype=|S20]
    • DeployingCountry
      (traj)
      |S20
      ...
      long_name :
      Deploying country
      units :
      -
      [26843 values with dtype=|S20]
    • DeploymentComments
      (traj)
      |S20
      ...
      long_name :
      Deployment comments
      units :
      -
      [26843 values with dtype=|S20]
    • ManufactureYear
      (traj)
      float32
      ...
      long_name :
      Manufacture year
      units :
      -
      [26843 values with dtype=float32]
    • ManufactureMonth
      (traj)
      float32
      ...
      long_name :
      Manufacture month
      units :
      -
      [26843 values with dtype=float32]
    • ManufactureSensorType
      (traj)
      |S20
      ...
      long_name :
      Manufacture Sensor Type
      units :
      -
      [26843 values with dtype=|S20]
    • ManufactureVoltage
      (traj)
      float32
      ...
      long_name :
      Manufacture voltage
      units :
      V
      [26843 values with dtype=float32]
    • FloatDiameter
      (traj)
      float64
      ...
      long_name :
      Diameter of surface floater
      units :
      cm
      [26843 values with dtype=float64]
    • SubsfcFloatPresence
      (traj)
      bool
      ...
      long_name :
      Subsurface Float Presence
      units :
      -
      [26843 values with dtype=bool]
    • DrogueType
      (traj)
      |S7
      ...
      drogue_type :
      Drogue Type
      units :
      -
      [26843 values with dtype=|S7]
    • DrogueLength
      (traj)
      float64
      ...
      long_name :
      Length of drogue.
      units :
      m
      [26843 values with dtype=float64]
    • DrogueBallast
      (traj)
      float64
      ...
      long_name :
      Weight of the drogue's ballast.
      units :
      kg
      [26843 values with dtype=float64]
    • DragAreaAboveDrogue
      (traj)
      float64
      ...
      long_name :
      Drag area above drogue.
      units :
      m^2
      [26843 values with dtype=float64]
    • DragAreaOfDrogue
      (traj)
      float64
      ...
      long_name :
      Drag area drogue.
      units :
      m^2
      [26843 values with dtype=float64]
    • DragAreaRatio
      (traj)
      float64
      ...
      long_name :
      Drag area ratio
      units :
      m
      [26843 values with dtype=float64]
    • DrogueCenterDepth
      (traj)
      float64
      ...
      long_name :
      Average depth of the drogue.
      units :
      m
      [26843 values with dtype=float64]
    • DrogueDetectSensor
      (traj)
      |S20
      ...
      long_name :
      Drogue detection sensor
      units :
      -
      [26843 values with dtype=|S20]
    • ve
      (obs)
      float32
      ...
      long_name :
      Eastward velocity
      units :
      m/s
      [44544647 values with dtype=float32]
    • vn
      (obs)
      float32
      ...
      long_name :
      Northward velocity
      units :
      m/s
      [44544647 values with dtype=float32]
    • temp
      (obs)
      float32
      ...
      long_name :
      Sea Surface Bulk Temperature
      units :
      degree_Celcius
      [44544647 values with dtype=float32]
    • err_lat
      (obs)
      float32
      ...
      long_name :
      Standard error in latitude
      units :
      degrees_north
      [44544647 values with dtype=float32]
    • err_lon
      (obs)
      float32
      ...
      long_name :
      Standard error in longitude
      units :
      degrees_east
      [44544647 values with dtype=float32]
    • err_temp
      (obs)
      float32
      ...
      long_name :
      Standard error in temperature
      units :
      degree_Celcius
      [44544647 values with dtype=float32]
    • drogue_status
      (obs)
      bool
      ...
      long_name :
      Status indicating the presence of the drogue
      units :
      -
      flag_values :
      1,0
      flag_meanings :
      drogued, undrogued
      [44544647 values with dtype=bool]
    • title :
      Global Drifter Program six-hourly drifting buoy collection
      history :
      Last update July 2022. Metadata from dirall.dat and deplog.dat
      Conventions :
      CF-1.6
      date_created :
      2022-12-08T18:44:27.784441
      publisher_name :
      GDP Drifter DAC
      publisher_email :
      aoml.dftr@noaa.gov
      publisher_url :
      https://www.aoml.noaa.gov/phod/gdp
      licence :
      freely available
      processing_level :
      Level 2 QC by GDP drifter DAC
      metadata_link :
      https://www.aoml.noaa.gov/phod/dac/dirall.html
      contributor_name :
      NOAA Global Drifter Program
      contributor_role :
      Data Acquisition Center
      institution :
      NOAA Atlantic Oceanographic and Meteorological Laboratory
      acknowledgement :
      Lumpkin, Rick; Centurioni, Luca (2019). NOAA Global Drifter Program quality-controlled 6-hour interpolated data from ocean surface drifting buoys. [indicate subset used]. NOAA National Centers for Environmental Information. Dataset. https://doi.org/10.25921/7ntx-z961. Accessed [date].
      summary :
      Global Drifter Program six-hourly data
      doi :
      10.25921/7ntx-z961
    In [ ]:
    # set up some need variables and default plotting properties
    
    lon = drifters.lon.values
    lat = drifters.lat.values 
    u = drifters.ve.values*100 #convert to cm/s
    v = drifters.vn.values*100 #convert to cm/s
    sst = drifters.temp.values
    tim = drifters.time.values
    
    #remove data points containing NaNs
    bool = ~np.isnan(u) & ~np.isnan(v) & ~np.isnan(lat) & ~np.isnan(lon)
    lat = lat[bool]
    lon = lon[bool]
    u = u[bool]
    v = v[bool]
    sst = sst[bool]
    tim = tim[bool]
    
    figsize=np.array([18, 12]);
    projection=ccrs.PlateCarree();
    #some other projections we might try
    #projection=ccrs.Robinson();
    #projection=ccrs.LambertCylindrical();
    
    #choose a default colormap
    cmap = plt.cm.get_cmap('Spectral_r', 64)
    #some other colormaps.  Note that the '_r' reverses the map
    #cmap = plt.cm.get_cmap('plasma_r', 40)
    #cmap = plt.cm.get_cmap('magma_r', 40)
    #cmap = plt.cm.get_cmap('inferno_r', 40)
    #cmap = plt.cm.get_cmap('viridis_r', 40)
    #cmap = plt.cm.get_cmap('RdYlBu_r', 64)
    
    #for future reference, define a function to set up our map
    def map_setup(figsize,projection):
         fig = plt.figure(figsize=figsize)
         ax = fig.add_subplot(projection=projection)
         ax.add_feature(cfeature.LAND, facecolor='grey')
         ax.gridlines(draw_labels=True)  
         ax.set_ylim(-80,80)
         return fig, ax
    

    Exploring the Drifter Dataset¶

    Let's make a basic plot of the drifter dataset. First we will loop over all of the trajectories in order to plot each trajectory with a different color.

    In [ ]:
    fig, ax = map_setup(figsize,projection) 
    
    #find the head (a) and tail (b) of each segment
    b = np.cumsum(drifters.rowsize)-1
    a = b-drifters.rowsize+1
    
    #in what follows, we need to work with the original data drifters.lat and drifters.lat, not lat and
    #lon created above to have NaNs removed, because drifters.rowsize applies to the original data
    
    #set lon to nan when we jump across the dateline, for plotting purposes
    lon_nojumps = drifters.lon.values
    lon_nojumps[np.abs(lon_nojumps-np.roll(lon_nojumps, -1))>10] = np.nan; #nan for lon jumps > 10 degrees
    
    #plot each float trajectory with its own color ... this takes a minute
    for n in range(0,np.size(drifters.ID.values)):
         #index into values belonging to the nth trajectory
         index = np.arange(a.item(n),b.item(n))  
         #note you can't use a[n],b[n] because those are length 1 arrays, not scalars
         if index.size > 0:
             plt.plot(lon_nojumps[index], drifters.lat.values[index],transform=ccrs.PlateCarree(),linewidth=0.5)
             #The 'transform = ccrs.PlateCarree()' specifies that the data are given as latitude and longitude
             #By default, the data is assumed to be in the same projection as that of the axis object
    

    Two-Dimensional Histograms¶

    However, this doesn't tell us much quantitatively. To go further we're going to start exploring this dataset as a distribution. The first step is to plot the two dimensional histogram of observation locations in latitude–longitude space.

    In [ ]:
    #In the code below, stats.binned_statistic_2d() is the central function, which we can use to compute 
    #a number of different statistics. First we are interested in the number of observations in each bin 
    #specified by statistic='count'.  The third argument, where observation values would normally go, can
    #be 'None' in such a case.
    
    fig, ax = map_setup(figsize,projection) 
    
    dlatlon = 0.25
    lonbins = np.arange(-180, 180, dlatlon)
    latbins = np.arange(-80, 80, dlatlon)
    
    hist = stats.binned_statistic_2d(lat, lon, None, bins=[latbins, lonbins], statistic='count')  # Returns a object
    hist.statistic[hist.statistic == 0] = np.nan  # Sets all values of 0 to nan as log10(0) = -inf
    
    #In using stats.binned_statistic_2d, the input argument needs to be a 1-D array.  If you're using higher-
    #dimensional data, such as model data, then you would want to flatten the input arrays with np.ravel(). 
    #Similarly the input arguments should contain no NaNs.  If they do, you would want to remove then with
    #bool = ~np.isnan(x) & ~np.isnan(y)
    #hist = stats.binned_statistic_2d(x[bool], y[bool], None, bins=[xbins,ybins], statistic='mean').statistic  
    
    image = plt.pcolormesh(lonbins, latbins, np.log10(hist.statistic), cmap=cmap, shading='flat', \
                           transform=ccrs.PlateCarree()) 
    plt.title('Distribution of Six-Hourly Data Points')
    plt.clim(1,2.75)
    
    fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, \
                 label='Log10 Number of Observations');   
    

    We see that the observation density is high in the central gyres, particularly in the North Atlantic, and is low in the Arctic, Antarctic, upwelling zones, and shallow seas.

    Let's take a look now at the distribution in time. For this, it will be useful to use latitude as one axis and time as the other.

    In [ ]:
    #distribution of data points over latitude and time
    
    #is this really the easiest way to do this stuff?
    yr=tim.astype('datetime64[Y]').astype(int)+1970  
    dy=tim.astype('datetime64[D]').astype(int) - tim.astype('datetime64[Y]').astype('datetime64[D]').astype(int)+1
    #dy is yearday, formed as (day) - (day for the Jan 1 of the current year)
    
    fig, ax = plt.subplots(1, 2,figsize=np.array([18, 10]))
    plt.sca(ax[0]) 
    
    xbins = np.arange(1978, 2023)
    hist = stats.binned_statistic_2d(lat, yr, None, bins=[latbins, xbins], statistic='count')  # Returns a object
    hist.statistic[hist.statistic == 0] = np.nan  # Sets all values of 0 to nan as log10(0) = -inf
    
    image = plt.pcolormesh(xbins, latbins, np.log10(hist.statistic), cmap=cmap, shading='flat') 
    plt.title('Data Distribution Over Year and Latitude')
    
    ax[0].set_xlabel('Year')
    ax[0].set_ylabel('Latitude')
    
    plt.sca(ax[1]) 
    
    xbins = np.arange(0, 366, 5)
    hist = stats.binned_statistic_2d(lat, dy, None, bins=[latbins, xbins], statistic='count')  # Returns a object
    hist.statistic[hist.statistic == 0] = np.nan  # Sets all values of 0 to nan as log10(0) = -inf
    
    image = plt.pcolormesh(xbins, latbins, np.log10(hist.statistic), cmap=cmap, shading='flat') 
    plt.title('Data Distribution Over Yearday and Latitude')
    ax[1].set_xlabel('Day of Year')
    plt.yticks([])
    
    plt.subplots_adjust(wspace=0.025)
    fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=30, pad=0.08, label='Log10 Number of Data Points'); 
    fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=30, pad=0.08, label='Log10 Number of Data Points');
    

    We see that there is much interannual variability of the data distribution, but little annual variability. This is good news, because it means aliasing of the annual cycle, which we expect to be stronger than the interannual variability in this case, won't impact the mean statistics.

    Two-Dimensional Means and Standard Deviations¶

    It's very useful to also examine basic statistics as a function of two parameters. The most obvious choices for the axes of such two-dimensional statistics in this case would again be latitude and longitude.

    In [ ]:
    #in working with binned_statistic_2d, we need to make sure we have removed NaNs, which 
    #we did just after we loaded in the data
    
    fig, ax = map_setup(figsize,projection) 
    
    ubar = stats.binned_statistic_2d(lat, lon, u, bins=[latbins, lonbins], statistic='mean')  
    vbar = stats.binned_statistic_2d(lat, lon, v, bins=[latbins, lonbins], statistic='mean')  
    meanflow = np.sqrt(ubar.statistic ** 2 + vbar.statistic ** 2)
    
    image = plt.pcolormesh(lonbins, latbins, meanflow, cmap=cmap, shading='flat',transform=ccrs.PlateCarree()) 
    plt.title('Speed of Mean Surface Currents')
    plt.clim(0, 80) 
    
    fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Speed (cm/s)'); 
    

    We see the major current systems as thin ribbons: the Gulf Stream, the Kuroshio, the Brazil Current, the Agulhas Current, the equatorial current system, and so forth.

    A slightly different quantity is the mean of the current speed, rather than the speed or magnitude of the mean flow.

    The mean speed paints the currents as slightly broader ribbons, because now the magnitude is not decreased by the variability of a current's direction.

    A measure of the variability is the velocity standard deviation, which we look at next.

    In [ ]:
    ustd = stats.binned_statistic_2d(lat, lon, u, bins=[latbins, lonbins], statistic='std').statistic
    vstd = stats.binned_statistic_2d(lat, lon, v, bins=[latbins, lonbins], statistic='std').statistic 
    
    std = np.sqrt(ustd ** 2 + vstd ** 2)
    std[std == 0] = np.nan #set zero values of standard deviation to nans
    
    fig, ax = map_setup(figsize,projection) 
    
    image = plt.pcolormesh(lonbins, latbins, std, cmap=cmap, shading='flat', transform=ccrs.PlateCarree()) 
    plt.title('Standard Deviation of Surface Currents')
    plt.clim(0, 80)
    
    fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Standard Deviation (cm/s)');
    

    To see how these compare we can examine the coefficient of variation, defined as the ratio of the standard devation to the mean. This is a nondimensional quantity showing the relative magnitude of variability. The larger the coefficient of variation, the larger typical deviations from the mean are with respect to the mean itself.

    Note that the coefficient of variation is only meaningful when there is a true zero, that is for quantities like speed, height, amount of precipitation or radiation, etc. Such data is called ratio data in statistics. One would not use it for, say, temperature, because while 0 K is well defined, it is not a meaningful reference zero for typical planetary processes.

    For plotting the coefficient of variation, it is useful to take the log, as this has the nice property that log(1/a)=-log(a).

    In [ ]:
    fig, ax = map_setup(figsize,projection) 
    
    image = plt.pcolormesh(lonbins, latbins, np.log2(np.divide(std,meanflow)), \
                           cmap=plt.cm.get_cmap('BrBG_r', 64), shading='flat', transform=ccrs.PlateCarree()) 
    plt.title('Coefficent of Variation (Ratio of Standard Deviation to Mean)')
    plt.clim(-3,3)
    
    fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Log2 Coefficent of Variation (nondimensional)');
    

    Let's change now to another quantity, the sea surface temperature.

    However, note that the number of data points involved for which the temperature is measured is considerably smaller than the total. Thus, if we were going to look further at this, we would also want to redo the data distributions we presented earlier, showing only the number of data points for which sea surface temperature is not NaN.

    For expediency, we will skip this step and go straight to the two-dimensional statistics.

    In [ ]:
    bool = ~np.isnan(sst) 
    
    fig, ax = map_setup(figsize,projection) 
    
    sstbar = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='mean').statistic  
    image = plt.pcolormesh(lonbins, latbins, sstbar, cmap=cmap, shading='flat', transform=ccrs.PlateCarree()) 
    plt.title('Mean Sea Surface Temperature')
    plt.clim(0, 30) 
    
    fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Temperature (deg C)'); 
    

    Note the multi-scale structure present here: global, regional, and small-scale or mesoscale.

    The large-scale temperature gradient is dominating the picture, obscuring the other variability. To account for this, we can subtract the longitudinally-averaged temperature at each latitude. This forms the temperature deviation from the zonal average.

    In [ ]:
    #temperature deviation from zonal average
    fig, ax = map_setup(figsize,projection) 
    
    #If anyone knows an easier way to tile an array into a matrix, let me know!
    sstbarlat=np.repeat(np.atleast_2d(np.nanmean(sstbar, axis=1)).T,np.size(sstbar,axis=1),axis=1)
    image = plt.pcolormesh(lonbins, latbins, sstbar-sstbarlat, cmap=cmap, shading='flat', transform=ccrs.PlateCarree()) 
    plt.title('Temperature Deviation from Zonal Average')
    plt.clim(-7, 7) 
    
    fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Temperature (deg C)'); 
    

    We see that subtracting the zonal mean reveals smaller-scale structures. We now see more clearly that generally speaking, it's colder on the eastern side than on the western sides of ocean basins.

    Finally we look at the sea surface temperature standard deviation.

    In [ ]:
    #sea surface temperature standard deviation
    bool = ~np.isnan(sst) 
    
    fig, ax = map_setup(figsize,projection) 
    
    sststd = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='std').statistic
    
    image = plt.pcolormesh(lonbins, latbins, sststd, cmap=cmap, shading='flat', transform=ccrs.PlateCarree()) 
    plt.title('Sea Surface Temperature Standard Deviation')
    plt.clim(0, 6) 
    
    fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Temperature Standard Deviation (deg C)'); 
    

    This shows a remarkably strong asymmetry between the northern and southern hemispheres.

    Regions of the highest mean sea surface temperature—such as the tropical warm pools—and also regions of the lowest mean sea surface temperature—the polar regions— are both regions of notably low temperature variability.

    The regions of highest tempertature variability are the eddying or high velocity variance regions of the northern subtropical western boundary currents, specifically the Gulf Stream and Kuroshio, as well as relatively shallow seas such as the Sea of Okhotsk and the Black Sea.

    Distributions and Statistics in Non-Spatial Dimensions¶

    For a while now we been looking at the statistics in latitude–longitude space. However, there are many other perspectives from which we could view this data. Let's start by returning to temperature and trying another way to examine the large-scale gradient.

    In [ ]:
    #mean sea surface temperature, with latitudinal profile
    
    bool = ~np.isnan(sst) 
    sstbar = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='mean').statistic  
    
    xbins=np.arange(-3, 32,.1)
    latmid=latbins[0:-1]+1/2*(latbins[1]-latbins[0])
    latmat=np.repeat(np.atleast_2d(latmid).T,np.size(sstbar,axis=1),axis=1)
    mat = stats.binned_statistic_2d(latmat[~np.isnan(sstbar)],sstbar[~np.isnan(sstbar)], None, \
                                    bins=[latbins, xbins], statistic='count').statistic  
    
    fig, ax = plt.subplots(1, 2, width_ratios=[3, 1],figsize=np.array([18, 10]),sharey=True)
    plt.sca(ax[0]) 
    
    image = plt.pcolormesh(lonbins, latbins, sstbar, cmap=cmap, shading='flat') 
    plt.title('Mean Sea Surface Temperature')
    plt.clim(0, 30) 
    #ax[0].set_aspect(1.25)
    
    #It seems setting the aspect ratio of one plot breaks sharey, so I have to manually set the aspect ratio of the other
    plt.sca(ax[1]) 
    image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat') 
    plt.title('SST Latitudinal Profile')
    plt.clim(0,1.85) 
    ax[1].set_aspect(0.364)
    
    ax[1].yaxis.tick_right()
    ax[1].set_xlabel('Mean SST at Each Gridpoint (deg C)')
    plt.subplots_adjust(wspace=0.025)
    
    fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=30, pad=0.08, label='Temperature (deg C)'); 
    fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=30/3, pad=0.08, label='Log10 Number of Grid Points'); 
    

    The right-hand panel now has latitude as its y-axis, but its x-axis is the mean temperature at each gridpoint. We can think of this as rotating the data distribution: rather than by looking at it from the 'top', with latitidue and longitude as the axes and the temperature flattened, we turn it to the side such that longitude disappears, but temperature appears as an explicit dimension.

    The right-hand panel now summarizes the distribution along the longitude axis from the left-hand panel. The different 'stripes' in the latitude–temperature distribution likely represent the different ocean basins, a hypothesis one could readily verify.

    Let's continue with this approach and look at latitudinal profiles of number of different quantities.

    In [ ]:
    #computations for latitudinal profiles 
    
    bool = ~np.isnan(sst) 
    sstbar = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='mean').statistic  
    sststd = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='std').statistic  
    spdbar = stats.binned_statistic_2d(lat, lon, np.abs(u+1j*v), bins=[latbins, lonbins], statistic='mean').statistic  
    
    ubar = stats.binned_statistic_2d(lat, lon, u, bins=[latbins, lonbins], statistic='mean')  
    vbar = stats.binned_statistic_2d(lat, lon, v, bins=[latbins, lonbins], statistic='mean')  
    meanflow = np.sqrt(ubar.statistic ** 2 + vbar.statistic ** 2)
    
    ustd = stats.binned_statistic_2d(lat, lon, u, bins=[latbins, lonbins], statistic='std').statistic
    vstd = stats.binned_statistic_2d(lat, lon, v, bins=[latbins, lonbins], statistic='std').statistic 
    
    std = np.sqrt(ustd ** 2 + vstd ** 2)
    std[std == 0] = np.nan #set zero values of standard deviation to nans
    
    latmid=latbins[0:-1]+1/2*(latbins[1]-latbins[0])
    latmat=np.repeat(np.atleast_2d(latmid).T,np.size(sstbar,axis=1),axis=1)
    
    In [ ]:
    #latitudinal profiles 
    
    fig, ax = plt.subplots(1, 4,figsize=np.array([16, 9]))
    plt.subplots_adjust(wspace=0.035)
    
    plt.sca(ax[0]) 
    xbins=np.arange(-3, 32,.1)
    mat = stats.binned_statistic_2d(latmat[~np.isnan(sstbar)],sstbar[~np.isnan(sstbar)], None, \
                                    bins=[latbins, xbins], statistic='count').statistic  
    image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat') 
    plt.title('Mean SST')
    plt.ylabel('Latitude')
    plt.xlabel('Temperature (deg C)')
    
    plt.sca(ax[1]) 
    xbins=np.arange(0,7,0.025)
    mat = stats.binned_statistic_2d(latmat[~np.isnan(sststd)],sststd[~np.isnan(sststd)], None, \
                                    bins=[latbins, xbins], statistic='count').statistic 
    image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat') 
    plt.title('Temperature Standard Deviation')
    plt.xlabel('Temperature Standard Deviation (deg C)')
    
    plt.sca(ax[2]) 
    xbins=np.arange(0,60,1/3)
    mat = stats.binned_statistic_2d(latmat[~np.isnan(meanflow)],meanflow[~np.isnan(meanflow)], None, \
                                    bins=[latbins, xbins], statistic='count').statistic 
    image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat') 
    plt.title('Speed of Mean Flow')
    plt.xlabel('Velocity Magnitude (cm/s)')
    
    plt.sca(ax[3]) 
    xbins=np.arange(0,60,1/3)
    mat = stats.binned_statistic_2d(latmat[~np.isnan(std)],std[~np.isnan(std)], None, \
                                    bins=[latbins, xbins], statistic='count').statistic 
    image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat') 
    plt.title('Velocity Standard Deviation')
    plt.xlabel('Velocity Standard Deviation (cm/s)')
    
    #all colorbars are the same, so we can loop over like so
    for n in range(4):
        plt.sca(ax[n]) 
        plt.clim(0,1.85) 
        fig.colorbar(image, ax=ax[n], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Log10 Number of Grid Points'); 
        if n>0:
            plt.yticks([])
    

    These show the latitude distributions corresonding to latitude–longitude plots we made earlier.

    One pattern which jumps out is that the temperature standard deviation is lowest at very low (equatorial) latitudes and very high negative latitudes, that is, where the temperature is warmest or coldest.

    We can also examine the joint structure of these variables by taking the mean value of a third quantity. Some of these are interesting and some are less interesting. An interesting one is the joint structure of mean temperature and current speed.

    In [ ]:
    #joint structure of temperature and velocity 
    
    fig, ax = plt.subplots(1, 2,figsize=np.array([16, 10]))
    plt.subplots_adjust(wspace=0.025)
    
    plt.sca(ax[0]) 
    xbins=np.arange(-3, 32,.1)
    mat = stats.binned_statistic_2d(latmat[~np.isnan(sstbar)],sstbar[~np.isnan(sstbar)], None, \
                                    bins=[latbins, xbins], statistic='count').statistic  
    image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat') 
    plt.title('Distribution of Mean SST')
    plt.clim(0,1.85) 
    plt.ylabel('Latitude')
    plt.xlabel('Temperature (deg C)')
    fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Log10 Number of Grid Points'); 
    
    plt.sca(ax[1]) 
    mat = stats.binned_statistic_2d(latmat[~np.isnan(sstbar)],sstbar[~np.isnan(sstbar)], meanflow[~np.isnan(sstbar)], \
                                    bins=[latbins, xbins], statistic='mean').statistic 
    image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat') 
    plt.title('Mean Speed in Temperature / Latitude Bins')
    plt.clim(0,40) 
    plt.xlabel('Temperature Standard Deviation (deg C)')
    plt.yticks([])
    fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Velocity Magnitude (cm/s)');
    

    The left panel again shows the distribution of grid points in bins on the latitude / mean SST plane, while the right panel shows the mean temperature in bins on that same plane.

    Note that the at midlatitudes, the warmest grid points—those on the far right-hand edge of the distribution—occur in the fastest currents. These are the strong, but spatially compact western boundary currents advecting warmer water poleward.

    Finally, we can use this same approach to look at the annual cycle. We will now examine the mean current speed, and mean temperature, in bins on the latitude / day of year plane, as well as the mean values of their deviations from the zonal average.

    In [ ]:
    #annual cycle of current speed
    
    dy=tim.astype('datetime64[D]').astype(int) - tim.astype('datetime64[Y]').astype('datetime64[D]').astype(int)+1
    #dy is yearday, formed as (day) - (day for the Jan 1 of the current year)
    
    fig, ax = plt.subplots(1, 2,figsize=np.array([18, 10]))
    plt.subplots_adjust(wspace=0.035)
    xbins=np.arange(0, 366, 5)
    
    plt.sca(ax[0]) 
    mat = stats.binned_statistic_2d(lat, dy, np.abs(u+1j*v), bins=[latbins, xbins], statistic='mean').statistic 
    image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat') 
    plt.title('Current Speed Annual Cycle')
    ax[0].set_xlabel('Yearday')
    ax[0].set_ylabel('Latitude')
    plt.clim(0,60) 
    fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Velocity (cm/s)'); 
    
    plt.sca(ax[1]) 
    mat = mat - np.repeat(np.atleast_2d(np.nanmean(mat, axis=1)).T,np.size(mat,axis=1),axis=1)
    image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat') 
    plt.title('Zonal Deviation of Current Speed Annual Cycle')
    ax[0].set_xlabel('Yearday')
    ax[0].set_ylabel('Latitude')
    plt.clim(-12,12) 
    plt.yticks([])
    fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1,  label='Velocity (cm/s)'); 
    
    In [ ]:
    #annual cycle of temperature
    
    dy=tim.astype('datetime64[D]').astype(int) - tim.astype('datetime64[Y]').astype('datetime64[D]').astype(int)+1
    #dy is yearday, formed as (day) - (day for the Jan 1 of the current year)
    bool = ~np.isnan(sst) 
    
    fig, ax = plt.subplots(1, 2,figsize=np.array([18, 10]))
    plt.subplots_adjust(wspace=0.035)
    xbins=np.arange(0, 366, 5)
    
    plt.sca(ax[0]) 
    mat = stats.binned_statistic_2d(lat[bool], dy[bool], sst[bool], bins=[latbins, xbins], statistic='mean').statistic 
    image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat') 
    plt.title('Sea Surface Temperature Annual Cycle')
    ax[0].set_xlabel('Yearday')
    ax[0].set_ylabel('Latitude')
    plt.clim(0,31) 
    fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Temperature (deg C)'); 
    
    plt.sca(ax[1]) 
    mat = mat - np.repeat(np.atleast_2d(np.nanmean(mat, axis=1)).T,np.size(mat,axis=1),axis=1)
    image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat') 
    plt.title('SST Zonal Deviation Annual Cycle')
    ax[0].set_xlabel('Yearday')
    ax[0].set_ylabel('Latitude')
    plt.clim(-6,6) 
    plt.yticks([])
    fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Temperature (deg C)'); 
    

    From these two figures, we see that the current speed presents only a minor annual cycle, largely limited to the equatorial band, whereas the temperature unsurprisingly shows a very large annual cycle.

    This exploratory analysis has given us an overview of the main features apparent in the data. At this point, we would begin identifying features worthy of further study and / or forming hypotheses to investigate in more detail.

    In [ ]:
    print((time.time() - tic)/60)
    
    3.5829830010732016